X2013_08_01T20_31_13_000Z <- read_csv("~/Desktop/USC/Master/Fall Semester/Introduction to Health Data Science/Midterm Project/Possible Data/2013-08-01T20_31_13.000Z.csv")
## Rows: 876 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): State, Location 1
## dbl (5): Year, Smoke everyday, Smoke some days, Former smoker, Never smoked
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
smoke<- X2013_08_01T20_31_13_000Z
First, I used the data that has 7 variables: Year, State, Smoke Everyday, Smoke Some Days, Former Smoker, Never Smoked, Location 1. Then I found there are 876 observations, and 56 states (52 major states and 4 islands that belongs to the United States, such as Hawaii) are included.
Each states have 16 differnent observations in 16 years, except four islands. Then I found in most years from 1995 to 2010, 55-56 states. Although some states only have 51 observations, it will not affect our study, since it’s a small amount of data missing, which will not affect the study.
Then I used any(is.na()) function to find the NA values in the four columns: Smoke Everyday, Smoke Some Days, Former Smoker, Never Smoked. I found there were no NA values.
dim(smoke)
## [1] 876 7
table(smoke$Year)
##
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## 51 53 54 54 54 54 56 56 56 54 55 55 56 56 56 56
table(smoke$State)
##
## Alabama
## 16
## Alaska
## 16
## Arizona
## 16
## Arkansas
## 16
## California
## 16
## Colorado
## 16
## Connecticut
## 16
## Delaware
## 16
## District of Columbia
## 15
## Florida
## 16
## Georgia
## 16
## Guam
## 7
## Hawaii
## 15
## Idaho
## 16
## Illinois
## 16
## Indiana
## 16
## Iowa
## 16
## Kansas
## 16
## Kentucky
## 16
## Louisiana
## 16
## Maine
## 16
## Maryland
## 16
## Massachusetts
## 16
## Michigan
## 16
## Minnesota
## 16
## Mississippi
## 16
## Missouri
## 16
## Montana
## 16
## Nationwide (States and DC)
## 16
## Nationwide (States, DC, and Territories)
## 16
## Nebraska
## 16
## Nevada
## 16
## New Hampshire
## 16
## New Jersey
## 16
## New Mexico
## 16
## New York
## 16
## North Carolina
## 16
## North Dakota
## 16
## Ohio
## 16
## Oklahoma
## 16
## Oregon
## 16
## Pennsylvania
## 16
## Puerto Rico
## 15
## Rhode Island
## 16
## South Carolina
## 16
## South Dakota
## 16
## Tennessee
## 16
## Texas
## 16
## Utah
## 14
## Vermont
## 16
## Virgin Islands
## 10
## Virginia
## 16
## Washington
## 16
## West Virginia
## 16
## Wisconsin
## 16
## Wyoming
## 16
any(is.na(smoke$`Smoke everyday`))
## [1] FALSE
any(is.na(smoke$`Smoke some days`))
## [1] FALSE
any(is.na(smoke$`Former smoker`))
## [1] FALSE
any(is.na(smoke$`Never smoked`))
## [1] FALSE
According the the function of skim, I found the data of the four variable (Smoke Everyday, Smoke Some Days, Former Smoker, Never Smoked) are normally distributed.
library(skimr)
skim(smoke)
| Name | smoke |
| Number of rows | 876 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| State | 0 | 1.00 | 4 | 40 | 0 | 56 | 0 |
| Location 1 | 37 | 0.96 | 11 | 60 | 0 | 54 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1 | 2002.59 | 4.59 | 1995.0 | 1999.00 | 2003.0 | 2007.00 | 2010.0 | ▇▆▆▆▆ |
| Smoke everyday | 0 | 1 | 16.56 | 3.98 | 3.6 | 13.90 | 16.7 | 19.10 | 29.1 | ▁▃▇▃▁ |
| Smoke some days | 0 | 1 | 4.84 | 1.16 | 1.3 | 4.20 | 4.9 | 5.53 | 8.5 | ▁▂▇▃▁ |
| Former smoker | 0 | 1 | 24.32 | 3.50 | 9.9 | 22.90 | 24.5 | 26.20 | 33.4 | ▁▁▅▇▂ |
| Never smoked | 0 | 1 | 54.26 | 5.60 | 39.5 | 51.08 | 53.5 | 56.20 | 83.7 | ▁▇▂▁▁ |
smoke_avg<-
smoke%>%
group_by(Year)%>%
summarize(
S_everyday_avg=mean(`Smoke everyday`),
S_someday_avg=mean(`Smoke some days`),
S_former_avg=mean(`Former smoker`),
S_never_avg=mean(`Never smoked`)
)
smoke_avg%>%
ggplot(mapping = aes(x=Year,y=S_everyday_avg))+
geom_point()+
geom_smooth(method=lm,col="black")
## `geom_smooth()` using formula 'y ~ x'
smoke_avg%>%
ggplot(mapping = aes(x=Year,y=S_someday_avg))+
geom_point()+
geom_smooth(method=lm,col="black")
## `geom_smooth()` using formula 'y ~ x'
smoke_avg%>%
ggplot(mapping = aes(x=Year,y=S_former_avg))+
geom_point()+
geom_smooth(method=lm,col="black")
## `geom_smooth()` using formula 'y ~ x'
smoke_avg%>%
ggplot(mapping = aes(x=Year,y=S_never_avg))+
geom_point()+
geom_smooth(method=lm,col="black")
## `geom_smooth()` using formula 'y ~ x'
I found these five regions all showed a decreasing tendency in their numbers of people who smoke everyday. And these trend turned into a slight increase for the group of people who smoke somedays. However, for the group of people who are former smokers, the northeast, midwest, southeast, southwest all showed a slight increase; the west showed a slight decrease. Furthermore, for the group of people never smoke, it showed a increase tendency.
smoke<-
smoke%>%
mutate(Geo_cate=case_when(
State == "Connecticut" ~ "Northeast",
State == "Maine" ~ "Northeast",
State == "Massachusetts" ~ "Northeast",
State == "New Hampshire" ~ "Northeast",
State == "Rhode Island" ~ "Northeast",
State == "Vermont" ~ "Northeast",
State == "New Jersey" ~ "Northeast",
State == "New York" ~ "Northeast",
State == "Delaware" ~ "Northeast",
State == "Pennsylvania" ~ "Northeast",
State == "Alabama" ~ "Southeast",
State == "Arkansas" ~ "Southeast",
State == "Florida" ~ "Southeast",
State == "Georgia" ~ "Southeast",
State == "Kentucky" ~ "Southeast",
State == "Louisiana" ~ "Southeast",
State == "Mississippi" ~ "Southeast",
State == "North Carolina" ~ "Southeast",
State == "South Carolina" ~ "Southeast",
State == "Tennessee" ~ "Southeast",
State == "Virginia" ~ "Southeast",
State == "West Virginia" ~ "Southeast",
State == "Arizona" ~ "Southwest",
State == "Colorado" ~ "Southwest",
State == "Utah" ~ "Southwest",
State == "Nevada" ~ "Southwest",
State == "New Mexico" ~ "Southwest",
State == "Idaho" ~ "West",
State == "Montana" ~ "West",
State == "Wyoming" ~ "West",
State == "California" ~ "West",
State == "Washington" ~ "West",
State == "Oregon" ~ "West",
State == "Hawaii" ~ "West",
State == "Oklahoma" ~ "Southwest",
State == "Texas" ~ "Southwest",
State == "Illinois" ~ "Midwest",
State == "Indiana" ~ "Midwest",
State == "Iowa" ~ "Midwest",
State == "Kansas" ~ "Midwest",
State == "Michigan" ~ "Midwest",
State == "Minnesota" ~ "Midwest",
State == "Missouri" ~ "Midwest",
State == "Nebraska" ~ "Midwest",
State == "North Dakota" ~ "Midwest",
State == "Ohio" ~ "Midwest",
State == "South Dakota" ~ "Midwest",
State == "Wisconsin" ~ "Midwest",
))
#Smoke Everyday
smoke%>%
filter(Geo_cate=='Northeast')%>%
ggplot(mapping = aes(x=Year,y=`Smoke everyday`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Southwest')%>%
ggplot(mapping = aes(x=Year,y=`Smoke everyday`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='West')%>%
ggplot(mapping = aes(x=Year,y=`Smoke everyday`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Southeast')%>%
ggplot(mapping = aes(x=Year,y=`Smoke everyday`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Midwest')%>%
ggplot(mapping = aes(x=Year,y=`Smoke everyday`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
# Smoke Somedays
smoke%>%
filter(Geo_cate=='Northeast')%>%
ggplot(mapping = aes(x=Year,y=`Smoke some days`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Southwest')%>%
ggplot(mapping = aes(x=Year,y=`Smoke some days`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='West')%>%
ggplot(mapping = aes(x=Year,y=`Smoke some days`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Southeast')%>%
ggplot(mapping = aes(x=Year,y=`Smoke some days`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Midwest')%>%
ggplot(mapping = aes(x=Year,y=`Smoke some days`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
#Former Smoker
smoke%>%
filter(Geo_cate=='Northeast')%>%
ggplot(mapping = aes(x=Year,y=`Former smoker`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Southwest')%>%
ggplot(mapping = aes(x=Year,y=`Former smoker`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='West')%>%
ggplot(mapping = aes(x=Year,y=`Former smoker`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Southeast')%>%
ggplot(mapping = aes(x=Year,y=`Former smoker`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Midwest')%>%
ggplot(mapping = aes(x=Year,y=`Former smoker`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
#Never Smoke
smoke%>%
filter(Geo_cate=='Northeast')%>%
ggplot(mapping = aes(x=Year,y=`Never smoked`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Southwest')%>%
ggplot(mapping = aes(x=Year,y=`Never smoked`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='West')%>%
ggplot(mapping = aes(x=Year,y=`Never smoked`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Southeast')%>%
ggplot(mapping = aes(x=Year,y=`Never smoked`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
smoke%>%
filter(Geo_cate=='Midwest')%>%
ggplot(mapping = aes(x=Year,y=`Never smoked`))+
geom_point()+
geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'
dat <- c("Alabama (32.840569999605975, -86.63186000013877)",
"Alaska (64.84507999974238, -147.72205999986895)",
"Arizona (34.86596999961597, -111.76380999973156)",
"Arkansas (34.748649999697875, -92.27448999971358)",
"California (37.638300000444815, -120.99958999997835)",
"Colorado (38.842890000173554, -106.13314000041055)",
"Connecticut (41.56265999995918, -72.6498400002157)",
"Delaware (39.00883000020451, -75.57774000040052)",
"District of Columbia (38.89036999987576, -77.03195999965413)",
"Florida (28.932039999846268, -81.9289599999039)",
"Georgia (32.83967999993223, -83.62758000031658)",
"Hawaii (21.304850000427336, -157.85774999956269)",
"Idaho (43.682590000228515, -114.36368000023168)",
"Illinois (40.485010000411364, -88.99770999971656)",
"Indiana (39.76690999989677, -86.14996000035359)",
"Iowa (42.469390000048634, -93.81649000001335)",
"Kansas (38.34774000000118, -98.20077999969709)",
"Kentucky (37.645969999815804, -84.77496999996538)",
"Louisiana (31.31265999975932, -92.44567999993188)",
"Maine (45.25423000041434, -68.9850299999344)",
"Maryland (39.29057999976732, -76.6092600004485)",
"Massachusetts (42.27687000005062, -72.08269000004333)",
"Michigan (44.661320000317914, -84.71438999959867)",
"Minnesota (46.3556499998478, -94.79419999982997)",
"Mississippi (32.7455100000866, -89.53803000008429)",
"Missouri (38.63578999960896, -92.5663000000448)",
"Montana (47.06653000015956, -109.42441999998289)",
"Nebraska (41.6410400000961, -99.36572999973953)",
"Nevada (39.49323999972637, -117.07183999971608)",
"New Hampshire (43.65595000019255, -71.50036000041354)",
"New Jersey (40.13056999960594, -74.2736899996936)",
"New Mexico (34.52088000011207, -106.24057999976702)",
"New York (42.82699999955048, -75.54396999981549)",
"North Carolina (35.46624999963797, -79.1593199999179)",
"North Dakota (47.475320000018144, -100.11841999998285)",
"Ohio (40.06020999969189, -82.40426000019869)",
"Oklahoma (35.4720099999617, -97.52034999975251)",
"Oregon (44.567449999917756, -120.15502999983448)",
"Pennsylvania (40.79372999993973, -77.86069999960512)",
"Rhode Island (41.70828000002217, -71.5224700001902)",
"South Carolina (33.99855000018255, -81.0452500001872)",
"South Dakota (44.353130000049646, -100.37353000040906)",
"Tennessee (35.68094000038087, -85.77449000011325)",
"Texas (31.82724000022597, -99.42676999973554)",
"Utah (39.36070000030492, -111.58712999994941)",
"Vermont (43.625379999687425, -72.51764000028561)",
"Virginia (37.54268000028196, -78.45789000012326)",
"Washington (47.522280000022135, -120.47001000026114)",
"West Virginia (38.66550999958696, -80.71263999973604)",
"Wisconsin (44.39319000021851, -89.81636999977553)",
"Wyoming (43.23553999957147, -108.10982999975454)")
dat <- data.frame(state = dat, stringsAsFactors = FALSE)
dat_new <- data.frame(
state = gsub("\\s*\\(.+", "", dat$state, perl = TRUE),
lat = stringr::str_extract(dat$state, "(?<=\\()[0-9.-]+"),
lon = stringr::str_extract(dat$state, "[0-9.-]+(?=\\))")
)
dat_new$lon <- as.numeric(dat_new$lon)
dat_new$lat <- as.numeric(dat_new$lat)
str(dat_new)
## 'data.frame': 51 obs. of 3 variables:
## $ state: chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ lat : num 32.8 64.8 34.9 34.7 37.6 ...
## $ lon : num -86.6 -147.7 -111.8 -92.3 -121 ...
smoke1<-smoke
smoke1 <- left_join(smoke1,dat_new,
by = c("State" = "state"))
A. Smoking everyday
library(leaflet)
commu.pal <- colorNumeric(c('darkgreen','goldenrod','brown'), domain=smoke$`Smoke everyday`)
leaflet(smoke1)%>%
addProviderTiles('CartoDB.VoyagerLabelsUnder')%>%
addCircles(
lat = ~lat,lng = ~lon,
label = ~paste0(round(`Smoke everyday`,2)),color = ~commu.pal(`Smoke everyday`),
opacity = 1, fillOpacity = 1, radius = 500
)%>%
addLegend('bottomleft',pal = commu.pal,values = smoke1$`Smoke everyday`,title = 'The number of people smoking everyday',opacity = 1)
## Warning in validateCoords(lng, lat, funcName): Data contains 64 rows with either
## missing or invalid lat/lon values and will be ignored
library(leaflet)
commu.pal <- colorNumeric(c('darkgreen','goldenrod','brown'), domain=smoke1$`Smoke some days`)
leaflet(smoke1)%>%
addProviderTiles('CartoDB.VoyagerLabelsUnder')%>%
addCircles(
lat = ~lat,lng = ~lon,
label = ~paste0(round(`Smoke some days`,2)),color = ~commu.pal(`Smoke some days`),
opacity = 1, fillOpacity = 1, radius = 500
)%>%
addLegend('bottomleft',pal = commu.pal,values = smoke1$`Smoke some days`,title = 'The number of people smoking some days',opacity = 1)
## Warning in validateCoords(lng, lat, funcName): Data contains 64 rows with either
## missing or invalid lat/lon values and will be ignored
library(leaflet)
commu.pal <- colorNumeric(c('darkgreen','goldenrod','brown'), domain=smoke1$`Former smoker`)
leaflet(smoke1)%>%
addProviderTiles('CartoDB.VoyagerLabelsUnder')%>%
addCircles(
lat = ~lat,lng = ~lon,
label = ~paste0(round(`Former smoker`,2)),color = ~commu.pal(`Former smoker`),
opacity = 1, fillOpacity = 1, radius = 500
)%>%
addLegend('bottomleft',pal = commu.pal,values = smoke1$`Former smoker`,title = 'The number of people who is former smoker',opacity = 1)
## Warning in validateCoords(lng, lat, funcName): Data contains 64 rows with either
## missing or invalid lat/lon values and will be ignored
D. Never Smoke
library(leaflet)
commu.pal <- colorNumeric(c('darkgreen','goldenrod','brown'), domain=smoke1$`Never smoked`)
leaflet(smoke1)%>%
addProviderTiles('CartoDB.VoyagerLabelsUnder')%>%
addCircles(
lat = ~lat,lng = ~lon,
label = ~paste0(round(`Never smoked`,2)),color = ~commu.pal(`Never smoked`),
opacity = 1, fillOpacity = 1, radius = 500
)%>%
addLegend('bottomleft',pal = commu.pal,values = smoke1$`Never smoked`,title = 'The number of people who never smoke',opacity = 1)
## Warning in validateCoords(lng, lat, funcName): Data contains 64 rows with either
## missing or invalid lat/lon values and will be ignored
Conclusion: According to the above data analysis and graphic analysis, I found the smoking control policies possibly have improved a lot from 1990 to 2010. It is because the number of smoking people become less, the number of former smoking becomes more, and the number of people who never smoker becomes more. Except this, this trend can be also found in the 5 geographically regions: Northeast, Southwest, West, Southeast, and Midwest. Smokers become less, former smokers become more, and non-smokers becomes more.